Integrated analysis of human DNA methylation, gene expression, and genomic variation in iMETHYL database using kernel tensor decomposition-based unsupervised feature extraction

Integrating gene expression, DNA methylation, and genomic variants simultaneously without location coincidence (i.e., irrespective of distance from each other) or pairwise coincidence (i.e., direct identification of triplets of gene expression, DNA methylation, and genomic variants, and not integration of pairwise coincidences) is difficult. In this study, we integrated gene expression, DNA methylation, and genome variants from the iMETHYL database using the recently proposed kernel tensor decomposition-based unsupervised feature extraction method with limited computational resources (i.e., short CPU time and small memory requirements). Our methods do not require prior knowledge of the subjects because they are fully unsupervised in that unsupervised tensor decomposition is used. The selected genes and genomic variants were significantly targeted by transcription factors that were biologically enriched in KEGG pathway terms as well as in the intra-related regulatory network. The proposed method is promising for integrated analyses of gene expression, methylation, and genomic variants with limited computational resources.


Introduction
The integrated analysis of multiomics datasets has always been difficult; in particular, integrating gene expression, DNA methylation, and genetic variants has rarely been successful [1,2]; in contrast, many studies integrate two of these three, that is, DNA methylation and genomic variants [3][4][5], gene expression and DNA methylation [6][7][8][9], and gene expression and genomic variants [10]. Although Seal et al. [2] successfully predicted gene expression from copy number variants (CNV) and DNA methylation, they did not discuss the relationship between CNV and DNA methylation. Therefore, they did not conduct a truly integrated analysis. Bell et al. [1] examined DNA methylation as a function of genetic and gene expression variation but did not directly investigate the relationship between gene expression and genetic variants; therefore, it was not a true integrated analysis.
In this study, we applied a recently proposed method [11] for the integrated analysis of gene expression, genetic variants, and DNA methylation using data retrieved from the iMETHYL database [12,13], without assuming any causal relationship between them in the framework of a purely data-driven strategy. Gene expression, methylation, and genetic variation shared patient-dependent patterns and were regulated by transcription factors. Enrichment analysis based on the genes targeted by these transcription factors is largely related to various biological functions.

Data set
The data set comprised gene expression, DNA methylation, and genomic variation profiles obtained from the same patients for each cell type (CD4 positive T cells: 99 patients, monocytes: 99 patients, neutrophils: 94 patients, for a total of 194 unique subjects. Venn diagram in Fig 1); 194 subjects common among these three measurements (i.e., gene expression, DNA methylation, and genomic variation) in one of three cell types were included in the analysis. The dataset analyzed in this study was obtained from the iMETHYL database after receiving approval from the Medical Ethics Committee of Iwate Medical University (approval no. HGH29-32) and the Ethics Committee of Chuo University (2019-6 and 2021-072).

Preprocessing
Fastq files obtained from RNA-seq were processed following the GTEx pipeline V8 [14] with slight modifications. Briefly, sequence reads were aligned to the GRCh37 human reference genome using STAR v2.5.0 [15], and bam files were generated.
Sequence reads obtained from whole-genome bisulfite sequencing were aligned using NovoAlign v3.02.08 (Novocraft Technologies, Sdn. Bhd., Selangor, Malaysia). The number of  converted and unconverted cytosines mapped to each CpG was counted using NovoMethyl v3.02.08 (Novocraft Technologies), and the proportion of unconverted cytosines was calculated as the DNA methylation level (%) [16]. Whole-genome sequence data were obtained from the Tohoku Medical Megabank Project [17], in which sequence reads were mapped onto the CRCh37 human reference genome using BWA-MEM [18] and variant calls were carried out using GATK v3.7 [19]. The resultant VCF files were further converted into the 012 format, where numeric variables ranging from 0 to 2 represent the number of non-reference alleles.

PLOS ONE
For the gene expression profiles, the bam files were converted into bed files using the bamtobed command. For gene expression and DNA methylation profiles, the bed files were separately integrated (summed or averaged) over every 25,000 nucleotide intervals, separately for 22 individual autosomes. Hereafter, these intervals are denoted as "genomic regions." Genetic variants were converted to numeric values (0-2) representing the number of non-reference alleles.

Tensor decomposition-based unsupervised feature extraction
We applied TD-based unsupervised FE optimized for multiomics data integration [11] to the dataset. Suppose that x i k jk 2 R N k �M�3 represents the values of the i k th components of the kth omics dataset for the jth subject. From these, we generated HOSVD [20] was applied to x jj 0 K resulting in where Gð' 1 ' 2 ' 3 Þ 2 R M�M�3 is a core tensor that represents a weight of the product u ' 1 j u ' 2 j0 u ' 3 k towards x jj 0 k . u ' 1 j ; u ' 2 j 0 2 R M�M and u ' 3 k 2 R 3�3 are singular value orthogonal matrices. u ' 1 j ¼ u ' 2 j0 when ℓ 1 = ℓ 2 and j = j 0 . After identifying u ' 1 j of interest and denoting a set of these ℓ 1 s as O ' 1 , we can derive the singular value vectors attributed to i k s by and attribute Pvalues to i k assuming that u ' 1 i k follows a Gaussian distribution (null hypothesis) where P w 2 [> x] is the cumulative χ 2 distribution whose argument is larger than x and s ' 1 is the standard deviation. P i k s were corrected using the BH criterion [20] and i k s with an adjusted P i k of less than 0.01 were selected.

Enrichment analysis
Enrichment analysis was performed using Enrichr software [23].

Transcription factor regulation analysis
Information on TF mutual regulation relations was retrieved from Regnetworkweb [24] and TRRUST2 [25].

Identification of TFBSs and genes associated with detected genetic variants
TFBSs and genes associated with genetic variation were identified using SNPnexus [26].

Identification of u ' 1 j s of interest
Since this study included only healthy individuals, we could not identify differentially expressed genes (DEG). Therefore, we employed a fully unsupervised strategy. Defining DEG is difficult using this strategy. Our criterion was as follows: seek u ' 1 j s that are common among distinct individual autosomes. If the 22 u ' s j s identified for individual autosomes share the same subject dependence j, it is unlikely to be accidental. Although there is some possibility that they reflect measurement bias; for example, if the total number of reads differs from subject to subject, this is very unlikely to be caused by measurement bias for the following reasons. First, as the present study was an integrated analysis of three omics measurements, the same patterns of subject measurement bias were unlikely to occur for the three omics datasets simultaneously because the experimental procedures differed from each other. Second, as u ' 1 j are orthogonal to each other for distinct ℓ 1 s, if more than two patterns of subject dependence are observed for more than one ℓs, none of them can be interpreted as measurement bias, which can result in only one unique pattern of subject dependence. Third, if subject-dependent patterns are caused by measurement bias, the selected genes based on these patterns may not be biologically reasonable; however, by applying enrichment analysis, the biological significance of the selected genes is easily validated. Table 1 lists the average absolute mutual correlation coefficients between the patterns attributed to 22 autosomes: where rðu chr ' 1 j ; u chr 0 ' 0 1 j Þ is the correlation coefficient between u chr ' 1 j and u chr 0 ' 0 1 j and u chr ' 1 j is u ' 1 j selected for chr-th autosome (1 � chr � 22). These factors are mutually correlated. To further validate the significance of the number of pairs among the total 22 × 21/2 = 231 pairs, we computed Pvalues and counted the number of pairs associated with significant correlations. Most pairs were significantly correlated ( Table 2). Hereafter this set of u ' 1 j is denoted as the "max correlated set." Next, we attempted to determine whether u ' 1 j s in the "max correlated set" correlated with the clinical data (Table 3). Unfortunately ℓ 1 s selected most frequently in Table 1 for "max correlated set," u ' 1 j s with ℓ 1 = 2, are not correlated with clinical data. Thus, we decided to select u ' 1 j s with ℓ 1 = 3 for CD4 + T cells and neutrophils as additional singular value vectors of interest (Those for monocytes were not selected because they did not correlate with the clinical data in Table 3). Table 1 also shows the mutual correlation coefficients between the patterns attributed to the 22 autosomes and the associated and corrected P-values for u ' 1 j s with ℓ 1 = 3. Although the correlations were less than those in the "max correlated set," they were more or less significant (Table 1), as the majority of the 231 pairs were significantly correlated (Table 3). Thus, we decided to employ u ' 1 j s with ℓ 1 = 3 for the downstream analyses. Hereafter, these sets with u ' 1 j are denoted as "clinically correlated sets."

Fig 2. Flowchart of the analyses for CD4T cells, neutrophils, and monocytes.
Gene expression and DNA methylation profiles were averaged over 25,000 nucleotide regions. Gene expression, DNA methylation and genomic variants were multiplied by themselves to obtain square matrices that are bundled into a tensor; tensor decomposition was then applied. The obtained u ' 1 j were compared with clinical information and used to compute u ' 1 i k to select regions ("max correlated set" (ℓ 1 = 2) and "clinically correlated set" (ℓ 1 = 3) are identified). For gene expression and DNA methylation, TFs that target genes included in the identified regions were selected and validated with enrichment analyses and comparisons with TTRUST2 and Regnetwork. Identified TFs were also compared with TFBSs identified by the selected genomic variants. https://doi.org/10.1371/journal.pone.0289029.g002

Selection of genomic regions and variants, and their biological validation
Following the described procedures, genomic regions and variants were identified together with the included/associated genes for genomic regions and variants using biomaRt and SNPnexus, respectively. Transcription factor-binding sites (TFBSs) associated with genomic variants were also identified using SNPnexus. After collecting the genes identified for individual autosomes, Enrichr was used to identify TFs that targeted genes included in the  T cells RUNX1, SPI1, RELA, TCF3, NELFE, CEBPD, BCL3, PML, ZMIZ1, SRF, GATA1, GATA2, IRF8,  IRF1, MYC, ATF2, TAF7, CHD1, NFE2L2, KLF4, ERG, STAT3, PBX3, KAT2A, FOXA1, CEBPB   Neutrophils SPI1, RUNX1, TCF3, CEBPD, PML, NELFE, RELA, BCL3, SRF, GATA1, CREB1, GATA2, ZMIZ1  genomic regions and validate their biological significance by applying KTD-based unsupervised FE. "Max correlated set": CD4 T cells. We identified 221 and 536 genomic regions for gene expression and DNA methylation, respectively, as well as 1,174,607 genomic variants that were supposed to coincide with the subject profiles represented by u ' 1 j listed in the column "CD4 T Cells" under the "Max correlated set" in Table 1. A total of 419 and 590 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. A total of 14,346 genes were associated with genomic variants. By uploading 419 genes to Enrichr for gene expression analysis, 26 TFs with threshold-adjusted P-values less than 0.05 were identified in the "ChEA & ENCODE consensus" category (Table 4). To validate their biological significance, these 26 TFs were uploaded to Enrichr and found to form a biologically significant set (Table 5). Furthermore, 25 TFs with threshold-adjusted P-values less than 0.05 were identified in the "ChEA 2016" category by uploading 590 genes for DNA methylation to Enrichr (Table 6). These 25 TFs formed a biologically significant set (Table 7). "Max correlated set": Neutrophils. We identified 356 and 154 genomic regions for gene expression and DNA methylation, respectively, and 778,698 genomic variants supposed to be coincident with the subject profiles represented by u ' 1 j listed in the column "Neutrophils" under "Max correlated set" in Table 1. A total of 490 and 500 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 15,356 genes were associated with genomic variants. Of the 490 genes involved in gene expression uploaded to Enrichr, 17 TFs with threshold-adjusted P-values less than 0.05 were identified in the "ChEA & ENCODE consensus" category (Table 4). These 17 TFs formed a biologically significant set (Table 8). Furthermore, by uploading 500 genes identified for DNA methylation, 33 TFs with threshold-adjusted P-values less than 0.05 were identified in the "ChEA 2016" category (Table 6). These 33 TFs formed a biologically significant set (Table 9). "Max correlated set": Monocytes. We identified 182 and 558 genomic regions for gene expression and DNA methylation, respectively, as well as 1,105,748 genomic variants that were supposed to coincide with the subject profiles represented by u ' 1 j in the column "Monocytes" under the "Max correlated set" in Table 1. In total, 453 and 1,015 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively.  Table 4. Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.
https://doi.org/10.1371/journal.pone.0289029.g003 Furthermore, 14,032 genes were associated with genomic variants. Twenty-four TFs with threshold-adjusted P-values less than 0.05 were identified in the "ChEA & ENCODE consensus" category by uploading 182 genes identified for gene expression (Table 4). These 24 TFs formed a biologically significant set (Table 10). Furthermore, 30 TFs with threshold-adjusted P-values less than 0.05 were identified in the "ChEA 2016" category by uploading 558 genes  Table 6. Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these. for DNA methylation (Table 6). These 30 TFs formed a biologically significant set (Table 11).
"Clinically correlated set": CD4 T cells. We identified 425 and 281 genomic regions for gene expression and DNA methylation, respectively as well as 1,073,649 genomic variants that are supposed to coincide with the subject profiles represented by u ' 1 j in the column "CD4 T cell" under "Clinically correlated set" in Table 1. In total, 794 and 412 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 13,178 genes were associated with genomic variants. After uploading 794 genes for gene expression, 36 TFs with threshold-adjusted P-values less than 0.05 were identified in the "ChEA & ENCODE consensus" category (Table 4). These 36 TFs formed a biologically significant set (Table 12). Furthermore, 18 TFs with threshold-adjusted P-values less than 0.05 were identified in the "ChEA 2016" category by uploading 412 genes for DNA  Table 6. Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.
"Clinically correlated set": Neutrophils. We identified 380 and 541 genomic regions for gene expression and DNA methylation, respectively, as well as 63,894 genomic variants that are supposed to coincide with subject profiles represented by u ' 1 j in the column "Neutrophils" under the "Clinically correlated set" in Table 1. A total of 610 and 499 genes were included in the genomic regions selected for gene expression and DNA methylation, respectively. Furthermore, 3,292 genes were associated with genomic variants. By uploading the 610 genes identified for gene expression to Enrichr, 26 TFs with threshold-adjusted P-values less than 0.05 were identified in the "ChEA & ENCODE consensus" category (Table 4). These 26 TFs formed a biologically significant set (Table 14). Furthermore, by uploading 499 genes for DNA methylation, 30 TFs with threshold adjusted P-values less than 0.05 were  Table 4. Upper: Regnetworkweb, lower: TTRUST2. Blue genes in Regnetworkweb and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.

Discussion
The selected genes were targeted by various TFs enriched in KEGG pathways; thus, genes with profiles coincident with the patient profiles expressed by the selected singular value vectors were biologically valid. The selected KEGG pathways were likely to express the biological properties of the participants' blood cells. If they are intraregulated, the identified genes and variants are probably effective. To determine whether the identified genes were intraregulated, we uploaded the selected TFs to two databases that validated the regulatory relationships between TFs: Regnetworkweb and TRRUST2. Regnetworkweb considers only direct regulatory relationships between TFs whereas TRRUST2 considers indirect regulatory relationships; for example, two TFs targeting the same genes (Figs 3-12). These are clearly  Table 6. Upper: Regnetworkweb, lower: TTRUST2. Blue genes in Regnetworkweb and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.
To validate the overlap between TFs that target genes identified based on gene expression or DNA methylation and those identified based on TFBS, the total number of human TFs must be determined, and we assume that it is approximately 2000 [27]. Table 16 shows the results of Fisher's test. TFs identified for gene expression significantly overlapped with TFBS associated with genomic variants, even though no significant overlaps were found between the TFBS identified using genomic variants and TFs identified for methylation (not shown here).
One might wonder why we did not compare our performance with that of existing methods. To the best of our knowledge, no other methods are comparable to ours. First, our analysis of association studies between gene expression, methylation, and genomic variants is free from location restrictions; this method can detect any kind of association between these genes, independent of their location along the genome. For example, we can identify interactions between genes and genomic variants that are distant from each other. This is  Table 4. Upper: Regnetworkweb, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 4. Blue genes in TTRUST2 are associated with these.
https://doi.org/10.1371/journal.pone.0289029.g009 because we can derive the singular value vectors attributed to the subjects, u ' 1 j , at the very beginning of the data analysis flow just after applying TD to x jj 0 k . Genomic regions and/or variants were then selected based on singular value vectors u ' 1 i k attributed to genomic regions or variants i k . Application of TD to x jj 0 k requires a very small amount of computational resources, as x jj 0 k 2 R M�M�K . To our knowledge, no other method can select genomic regions and variants using such a small amount of computational resources. In particular, treating genomic variants is difficult. For gene expression and methylation, these values can be averaged within individual genomic regions, resulting in a reduced dimension of i k , that is, N k . Nevertheless, this cannot be performed for genomic variants because the integers (1, 2, and 3) derived from the genomic variants are arbitrary. Averaging distinct integer numbers attributed to individual genomic variants can destroy the meaning of these integers. Despite this, our method is independent of the size of N k , and can be applied to genomic variants as is. To the best of our knowledge, no other methods can perform this  Table 6. Upper: Regnetwork web, lower: TTRUST2. Blue genes in Regnetwork web and yellow genes in TTRUST2 are TFs in Table 6. Blue genes in TTRUST2 are associated with these.
https://doi.org/10.1371/journal.pone.0289029.g010 task; thus, we could not compare the performance of our method with that of any other method.
Our methods do not require prior knowledge of the subjects. Singular value vectors attributed to the subjects, u ' 1 j , can be generated by applying TD to x jj 0 M , which does not require any additional information about the subjects. The selection of u ' 1 j used to select i K was based on the coincidence between those computed for individual autosomes. Therefore, genomic regions and variants can be selected in a fully unsupervised manner. However, the selected genes were significantly targeted by multiple TFs that were enriched in KEGG pathway terms.
Several biological insights were obtained from this population-based study. One possible application to clinical studies is to compare the outcomes of the present study with those of other clinical studies. Generally, both population-and clinical-based studies have their own biases, and by comparing their outcomes with each other, we can validate their outcomes, which is impossible when only individual outcomes are present.
However, this method had several limitations. First, it is applicable to multiomics datasets that share samples. In addition, because this is an unsupervised method, if there are no significant results in the downward analyses, we have no way to improve the results.